flowchart LR
A["📊 Input Data"] --> B["🔬 Variable\nSelection"]
B --> C["📋 Data\nPreparation"]
C --> D["🔍 Subgroup\nSearch"]
D --> E["✅ Consistency\nEvaluation"]
E --> F["📈 Output"]
style A fill:#E3F2FD,stroke:#1976D2,stroke-width:2px
style B fill:#E8F4FD,stroke:#2E86AB,stroke-width:2px
style C fill:#FFF3E0,stroke:#F57C00,stroke-width:2px
style D fill:#FCE4EC,stroke:#C2185B,stroke-width:2px
style E fill:#E8F5E9,stroke:#388E3C,stroke-width:2px
style F fill:#F3E5F5,stroke:#7B1FA2,stroke-width:2px
ForestSearch Package Architecture
Function Relationships and Module Organization
1 Overview
ForestSearch is a comprehensive R package for exploratory subgroup identification in clinical trials with survival endpoints. The package implements advanced methods from León et al. (2024) published in Statistics in Medicine.
Key Methods:
- Generalized Random Forests (GRF) for variable importance
- LASSO regularization for dimension reduction
- Exhaustive combinatorial search for subgroup discovery
- Split-sample validation for consistency assessment
- Bootstrap bias correction using infinitesimal jackknife methods
2 High-Level Workflow
The ForestSearch algorithm proceeds through five main phases:
After the core analysis, optional validation steps include:
flowchart LR
A["✅ ForestSearch\nResult"] --> B["🔄 Bootstrap\nBias Correction"]
A --> C["📈 K-Fold\nCross-Validation"]
B --> D["📊 Final\nReport"]
C --> D
style A fill:#E8F5E9,stroke:#388E3C,stroke-width:2px
style B fill:#E8F4FD,stroke:#2E86AB,stroke-width:2px
style C fill:#FCE4EC,stroke:#C2185B,stroke-width:2px
style D fill:#FFFDE7,stroke:#F9A825,stroke-width:2px
3 Phase 1: Main Entry Point
3.1 forestsearch() - The Orchestrator
Location: R/forest_search_revised.R
flowchart TB
FS[["forestsearch()"]]
FS --> P1["1. Validate Inputs"]
FS --> P2["2. GRF Variable Selection"]
FS --> P3["3. LASSO Selection"]
FS --> P4["4. Data Preparation"]
FS --> P5["5. Subgroup Search"]
FS --> P6["6. Consistency Evaluation"]
FS --> P7["7. Return Results"]
P1 --> P2
P2 --> P3
P3 --> P4
P4 --> P5
P5 --> P6
P6 --> P7
style FS fill:#2E86AB,stroke:#1a5276,stroke-width:3px,color:#fff
Key Parameters:
| Parameter | Description | Default |
|---|---|---|
df.analysis |
Analysis dataset | Required |
confounders.name |
Candidate variables | NULL |
use_lasso |
Enable LASSO | TRUE |
use_grf |
Enable GRF | TRUE |
hr.threshold |
Min HR for candidates | 1.25 |
pconsistency.threshold |
Consistency requirement | 0.90 |
fs.splits |
Validation splits | 1000 |
maxk |
Max factors per subgroup | 2 |
4 Phase 2: Variable Selection
4.1 GRF-Based Selection
Function: grf.subg.harm.survival() Location: R/grf_main.R
flowchart TB
subgraph GRF["GRF Variable Selection"]
A["Input: Data + Confounders"] --> B["fit_causal_forest()"]
B --> C["Compute Variable\nImportance"]
C --> D["Fit Policy Trees\n(depth 1, 2, 3)"]
D --> E["select_best_subgroup()"]
E --> F["extract_all_tree_cuts()"]
F --> G["Output: GRF Cuts"]
end
style A fill:#E3F2FD,stroke:#1976D2
style G fill:#E8F5E9,stroke:#388E3C
GRF Helper Functions:
| Function | Purpose |
|---|---|
create_grf_config() |
Initialize parameters |
fit_causal_forest() |
Fit causal survival forest |
compute_node_metrics() |
Metrics per tree node |
select_best_subgroup() |
Choose optimal subgroup |
extract_all_tree_cuts() |
Get cut expressions |
4.2 LASSO-Based Selection
Function: lasso_selection() Location: R/get_FSdata_helpers.R
flowchart TB
subgraph LASSO["LASSO Variable Selection"]
A["Input: Confounders"] --> B["cv.glmnet()\nCox-LASSO"]
B --> C["Select lambda.min"]
C --> D["Extract Coefficients"]
D --> E{"Coef ≠ 0?"}
E -->|Yes| F["Selected Variables"]
E -->|No| G["Omitted Variables"]
end
style A fill:#E3F2FD,stroke:#1976D2
style F fill:#E8F5E9,stroke:#388E3C
style G fill:#FFEBEE,stroke:#C62828
5 Phase 3: Data Preparation
5.1 get_FSdata() - Creating Binary Indicators
Location: R/get_FSdata_refactored.R
flowchart TB
subgraph DataPrep["Data Preparation Pipeline"]
A["Raw Data"] --> B["Classify Variables"]
B --> C{"Continuous?"}
C -->|Yes| D["Create Cut\nIndicators"]
C -->|No| E["Create Dummy\nVariables"]
D --> F["Apply LASSO\nFilter"]
E --> F
F --> G["Apply GRF\nCuts"]
G --> H["Handle Forced\nCuts"]
H --> I["Binary Indicator\nMatrix Z"]
end
style A fill:#E3F2FD,stroke:#1976D2
style I fill:#E8F5E9,stroke:#388E3C
Data Preparation Helpers:
| Function | Purpose |
|---|---|
is.continuous() |
Check variable type |
dummy() |
Create dummy variables |
get_conf_force() |
Generate forced cuts |
evaluate_cuts_once() |
Evaluate & cache cuts |
FS_labels() |
Convert q-codes to labels |
filter_by_lassokeep() |
Apply LASSO filter |
6 Phase 4: Subgroup Search
6.1 subgroup.search() - Exhaustive Combinatorial Search
Location: R/subgroup_search.R
flowchart TB
subgraph Search["Subgroup Search Algorithm"]
A["Generate All\nCombinations\n(up to maxk)"] --> B["Loop: Each\nCombination"]
B --> C{"Prevalence\n≥ minp?"}
C -->|No| B
C -->|Yes| D{"Size\n≥ n.min?"}
D -->|No| B
D -->|Yes| E{"Events\nd0 ≥ d0.min\nd1 ≥ d1.min?"}
E -->|No| B
E -->|Yes| F["Fit Cox Model"]
F --> G{"HR >\nthreshold?"}
G -->|No| B
G -->|Yes| H["Store\nCandidate"]
H --> B
B -->|Done| I["Sort by HR\nReturn Candidates"]
end
style A fill:#E3F2FD,stroke:#1976D2
style H fill:#FFF3E0,stroke:#F57C00
style I fill:#E8F5E9,stroke:#388E3C
Search Helper Functions:
| Function | Purpose |
|---|---|
prepare_search_data() |
Clean data |
generate_combination_indices() |
Create k-combinations |
search_combinations() |
Main search loop |
evaluate_combination() |
Test one combination |
get_covs_in() |
Factor indicators |
get_subgroup_membership() |
Membership vector |
format_search_results() |
Format as data.table |
7 Phase 5: Consistency Evaluation
7.1 subgroup.consistency() - Split-Sample Validation
Location: R/subgroup_consistency_main.R
Two algorithms are available:
7.2 Fixed Algorithm (Default)
flowchart TB
subgraph Fixed["Fixed Consistency Algorithm"]
A["Candidate\nSubgroup"] --> B["Repeat n.splits times:"]
B --> C["Random 50/50\nSplit"]
C --> D["Compute HR\nin Split 1"]
C --> E["Compute HR\nin Split 2"]
D --> F{"Both HR >\nhr.consistency?"}
E --> F
F -->|Yes| G["Count as\nConsistent"]
F -->|No| H["Count as\nInconsistent"]
G --> B
H --> B
B -->|Done| I["Pcons =\nConsistent / Total"]
end
style A fill:#E3F2FD,stroke:#1976D2
style I fill:#E8F5E9,stroke:#388E3C
7.3 Two-Stage Algorithm (Optional)
flowchart TB
subgraph TwoStage["Two-Stage Consistency Algorithm"]
A["Candidate"] --> B["Stage 1:\nQuick Screen\n(30 splits)"]
B --> C{"Pcons >\nscreen.threshold?"}
C -->|No| D["REJECT"]
C -->|Yes| E["Stage 2:\nBatched Evaluation"]
E --> F["Run batch of\n20 splits"]
F --> G["Compute Wilson CI"]
G --> H{"CI lower >\nthreshold?"}
H -->|Yes| I["ACCEPT"]
H -->|No| J{"CI upper <\nthreshold?"}
J -->|Yes| D
J -->|No| E
end
style A fill:#E3F2FD,stroke:#1976D2
style D fill:#FFEBEE,stroke:#C62828
style I fill:#E8F5E9,stroke:#388E3C
Consistency Helper Functions:
| Function | Purpose |
|---|---|
evaluate_subgroup_consistency() |
Fixed algorithm |
evaluate_consistency_twostage() |
Two-stage algorithm |
get_split_hr_fast() |
Fast HR for split |
wilson_ci() |
Wilson confidence interval |
early_stop_decision() |
Early stopping logic |
sort_subgroups() |
Sort by sg_focus |
extract_subgroup() |
Get subgroup definition |
8 Phase 6: Bootstrap Bias Correction
8.1 forestsearch_bootstrap_dofuture()
Location: R/bootstrap_dofuture_main.R
flowchart TB
subgraph Bootstrap["Bootstrap Bias Correction"]
A["Original\nResult"] --> B["Generate B\nBootstrap Samples"]
B --> C["For each bootstrap:"]
C --> D["Run forestsearch()\non bootstrap data"]
D --> E["Compute HRs:\nH*, H*_obs, H*_*"]
E --> C
C -->|Done| F["Aggregate Results"]
F --> G["Compute Bias\nCorrection"]
G --> H["Adjusted HR\nEstimate"]
end
style A fill:#E3F2FD,stroke:#1976D2
style H fill:#E8F5E9,stroke:#388E3C
Bias Correction Formulas:
| Method | Formula |
|---|---|
| Simple Optimism | \(H_{adj1} = H_{obs} - (H^*_* - H^*_{obs})\) |
| Double Bootstrap | \(H_{adj2} = 2H_{obs} - (H_* + H^*_* - H^*_{obs})\) |
Bootstrap Helper Functions:
| Function | Location |
|---|---|
bootstrap_results() |
Coordinate iterations |
run_single_bootstrap() |
One bootstrap iteration |
summarize_bootstrap_results() |
Aggregate statistics |
summarize_bootstrap_subgroups() |
Subgroup agreement |
9 Phase 7: Cross-Validation
9.1 forestsearch_Kfold()
Location: R/forestsearch_cross-validation.R
flowchart TB
subgraph CV["K-Fold Cross-Validation"]
A["Full Dataset"] --> B["Split into\nK Folds"]
B --> C["For fold k = 1 to K:"]
C --> D["Train: K-1 folds"]
C --> E["Test: fold k"]
D --> F["forestsearch()\non training"]
F --> G["Evaluate on\ntest fold"]
G --> C
C -->|Done| H["forestsearch_KfoldOut()\nAggregate Results"]
end
style A fill:#E3F2FD,stroke:#1976D2
style H fill:#E8F5E9,stroke:#388E3C
CV Helper Functions:
| Function | Purpose |
|---|---|
forestsearch_KfoldOut() |
Summarize K-fold results |
forestsearch_tenfold() |
Repeated K-fold wrapper |
find_covariate_any_match() |
Match covariates across folds |
10 Phase 8: Output & Visualization
10.1 S3 Methods and Summary Functions
flowchart TB
subgraph Output["Output Functions"]
A["forestsearch\nresult"] --> B["print.forestsearch()"]
A --> C["summary.forestsearch()"]
A --> D["sg_tables()"]
A --> E["plot_subgroup_results_forestplot()"]
B --> F["Console\nSummary"]
C --> G["Detailed\nStatistics"]
D --> H["gt Tables"]
E --> I["Forest Plot"]
end
style A fill:#E3F2FD,stroke:#1976D2
style F fill:#FFFDE7,stroke:#F9A825
style G fill:#FFFDE7,stroke:#F9A825
style H fill:#FFFDE7,stroke:#F9A825
style I fill:#FFFDE7,stroke:#F9A825
Output Functions:
| Function | Purpose |
|---|---|
print.forestsearch() |
Basic summary |
summary.forestsearch() |
Detailed stats |
sg_tables() |
gt-formatted tables |
plot_subgroup_results_forestplot() |
Forest plots |
cox_cs_fit() |
Spline Cox models |
filter_call_args() |
Filter function args |
11 Simulation & DGM Module
11.1 Data Generation for Simulations
flowchart TB
subgraph Simulation["Simulation Workflow"]
A["Define DGM\nParameters"] --> B["generate_aft_dgm_flex()"]
B --> C["simulate_from_dgm()"]
C --> D["Simulated\nTrial Data"]
D --> E["run_forestsearch_analysis()"]
E --> F["Evaluate\nPerformance"]
end
style A fill:#E3F2FD,stroke:#1976D2
style D fill:#FFF3E0,stroke:#F57C00
style F fill:#E8F5E9,stroke:#388E3C
Simulation Functions:
| Function | Location | Purpose |
|---|---|---|
generate_aft_dgm_flex() |
generate_aft_dgm_flex.R |
Create AFT DGM |
simulate_from_dgm() |
simulate_from_dgm.R |
Generate data |
run_mrct_simulation() |
mrct_simulation.R |
MRCT simulation |
run_single_simulation() |
oc_analyses_gbsg.R |
Single iteration |
12 Parallel Processing
ForestSearch supports multiple parallel backends:
flowchart LR
subgraph Backends["Parallel Backends"]
A["sequential"] --> E["Single Thread"]
B["multisession"] --> F["Background R Sessions"]
C["multicore"] --> G["Fork Processes\n(Unix/Mac)"]
D["callr"] --> H["Separate R Processes"]
end
style A fill:#ECEFF1,stroke:#546E7A
style B fill:#E8F4FD,stroke:#2E86AB
style C fill:#E8F5E9,stroke:#388E3C
style D fill:#FFF3E0,stroke:#F57C00
13 Package Dependencies
flowchart TB
subgraph Stats["Core Statistical"]
survival
grf
glmnet
policytree
end
subgraph Data["Data Handling"]
data.table
stringr
end
subgraph Par["Parallel"]
future
doFuture
foreach
callr
end
subgraph Viz["Visualization"]
ggplot2
gt
forestploter
end
FS["forestsearch"] --> Stats
FS --> Data
FS --> Par
FS --> Viz
style FS fill:#2E86AB,stroke:#1a5276,color:#fff
14 File Organization
R/
├── forest_search_revised.R # Main forestsearch()
├── get_FSdata_refactored.R # Data preparation
├── get_FSdata_helpers.R # Data prep helpers
├── subgroup_search.R # Exhaustive search
├── subgroup_consistency_main.R # Consistency evaluation
├── subgroup_consistency_helpers.R # Consistency helpers
├── grf_main.R # GRF subgroup ID
├── grf_helpers.R # GRF helpers
├── bootstrap_dofuture_main.R # Bootstrap main
├── bootstrap_analysis_dofuture.R # Bootstrap iteration
├── bootstrap_summaries_helpers.R # Bootstrap summary
├── summarize_bootstrap_subgroups.R # Subgroup summary
├── forestsearch_cross-validation.R # K-fold CV
├── summary_utility_functions.R # Output utilities
├── summary_forestsearch.R # S3 summary
├── plot_subgroup_results_forestplot.R # Forest plots
├── cox_spline_fit.R # Spline Cox
├── generate_aft_dgm_flex.R # DGM generation
├── simulate_from_dgm.R # Simulation
├── mrct_simulation.R # MRCT sim
└── oc_analyses_gbsg.R # OC analyses
15 Typical Workflow Example
# 1. Run main analysis
fs_result <- forestsearch(
df.analysis = trial_data,
confounders.name = c("age", "biomarker", "stage"),
hr.threshold = 1.25,
pconsistency.threshold = 0.90,
use_grf = TRUE,
use_lasso = TRUE,
details = TRUE
)
# 2. Bootstrap bias correction
fs_boot <- forestsearch_bootstrap_dofuture(
fs.est = fs_result,
nb_boots = 1000,
parallel_args = list(plan = "callr", workers = 6)
)
# 3. Cross-validation
fs_cv <- forestsearch_Kfold(
fs.est = fs_result,
Kfolds = 10
)
# 4. Summary and visualization
print(fs_result)
sg_tables(fs_result)16 References
- León LF, et al. (2024). Exploratory subgroup identification in the heterogeneous Cox model. Statistics in Medicine.
- Athey S, Imbens G (2016). Recursive partitioning for heterogeneous causal effects. PNAS.
- Wager S, Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. JASA.
Document generated for ForestSearch R package - CRAN submission preparation